Performance and scalability in an intelligent data operating layer system

ABSTRACT

Systems and methods that allow for an intelligence platform for distributed processing of big data sets including both structured and unstructured data types across two or more intelligent data operation engine servers. The intelligent data operation engine servers can form a conceptual understanding of content in each electronic file and then cooperates with a distributed index handler to index the conceptual understanding of the electronic file. A query pipeline and the distributed index handler in the intelligence platform cooperate with the two or more intelligent data operation engine servers to improve scalability and performance on the big data sets containing both structured and un-structured electronic files represented in the common index.

BACKGROUND

Big data technologies describe a new generation of technologies andarchitectures designed to economically extract value from very largevolumes of a wide variety of data by, for example, enablinghigh-velocity capture, discovery, and/or analysis. In short, big datatechnology can help to extract value from the digital universe. Big datacomes in one size: large. Systems that attempt to process such amountsof data will be awash with data, easily amassing terabytes and evenpetabytes of information. In information technology, the big data setsare so large and complex that they become awkward to process usingrelational databases and standard management tools.

BRIEF DESCRIPTION OF THE DRAWINGS

The multiple drawings refer to embodiments of the disclosure. Whileembodiments of the disclosure described herein are subject to variousmodifications and alternative forms, specific embodiments thereof havebeen shown by way of example in the drawings and will herein bedescribed in detail.

FIG. 1 illustrates an example of a network diagram of an embodiment ofan intelligence platform for processing big data sets of unstructureddata as well as structured data that includes intelligent data operationengine servers, one or more distributed index handlers, and one or moreinstances of action handlers.

FIG. 2 illustrates an example diagram of an embodiment of an enterpriseservice bus.

FIG. 3 illustrates an example block diagram of an embodiment of somemodules of an intelligent data operation engine server.

FIG. 4 illustrates an example block diagram of an embodiment of anintelligent data operation engine server generating a conceptualrepresentation of an electronic file.

FIGS. 5A-5B illustrates an example flow diagram of a process to createone or more conceptual understandings/representations of electronicfiles.

FIG. 6 illustrates an example flow diagram of an embodiment of aprocess.

FIG. 7 illustrates an example block diagram of an embodiment of anintelligent data operation engine with a speech recognition engineportion.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth,such as examples of specific data signals, components, connections, etc.in order to provide a thorough understanding of the present disclosure.It will be apparent, however, to one skilled in the art that the presentdisclosure may be practiced without these specific details. In otherinstances, well known components or methods have not been described indetail but rather in a block diagram in order to avoid unnecessarilyobscuring the present disclosure. Thus, the specific details set forthare merely exemplary. The specific details may be varied from and stillbe contemplated to be within the spirit and scope of the presentdisclosure.

In the following description of exemplary embodiments, reference is madeto the accompanying drawings that form a part hereof, and in which it isshown by way of illustration specific embodiments in which thedisclosure can be practiced. It is to be understood that otherembodiments can be used and structural changes can be made withoutdeparting from the scope of the embodiments of this disclosure. As usedherein, the terms “couple,” “connect,” and “attach” are interchangeableand include various forms of connecting one part to another eitherdirectly or indirectly. Also, it should be appreciated that one or morestructural features described in one embodiment could be implemented ina different embodiment, even if not specifically mentioned as being afeature thereof.

In general, systems and methods are discussed that allow for anintelligence platform configured for the processing of a large set ofboth unstructured data as well as structured data. The intelligenceplatform includes a series of modular and distributed servers configuredfor distributed processing of big data sets including both structuredand unstructured information types across two or more intelligent dataoperation engine servers. The intelligence platform further includes twoor more similar instances of an intelligent data operation engineserver. Each intelligent data operation engine server may be configuredto apply automatic advanced analytics, categorization, and clustering toall appropriate types of data including the structured and unstructureddata. The intelligent data operation engine servers can form aconceptual understanding of content in electronic files and thencooperate with a distributed index handler to index the conceptualunderstanding of the electronic file. The intelligence platform furtherincludes an index processing pipeline containing one or more instancesof the distributed index handler. Each instance of the distributed indexhandler is configured to split and index data into the two or moreinstances of the intelligent data operation engine servers, optimizeperformance by batching data, replicate all index commands, and invokedynamic load distribution. The intelligence platform can include a querypipeline with one or more instances of an action handler that allows forthe distribution of protocol action commands between the two or moreintelligent data operation engine servers. The query pipeline and theindex processing pipeline cooperate with the two or more intelligentdata operation engine servers to improve scalability and performance onthe big sets of data containing both structured and unstructuredelectronic files.

Overview

On the server side, the servers take advantage of both distributedcomputing among the set of servers, an intelligence engine contextuallymaking a nexus between various concepts, and mirroring of hardware andcontent to achieve near real time analyzing of big data sets and beingable to take an informed action based on the analysis.

Server Mirroring and Distributed Processing

FIG. 1 illustrates an example of a network diagram of an embodiment ofan intelligence platform for processing big data sets of unstructureddata as well as structured data that includes intelligent data operationengine servers, one or more distributed index handlers, and one or moreinstances of action handlers. In this example, one server computer forsimplicity represents each server site. However, it should be understoodthat each server site may include multiple server computers workingtogether collaboratively in a distributed manner as described above.Server computers 105M, 105A, 105B and 105C connected to the network 100may be configured as intelligent data operating layer servers(intelligent data operation engine servers). The intelligent dataoperation engine server instances may include a main intelligent dataoperation engine server 105M and multiple mirrored intelligent dataoperation engine servers 105A-105C. The main intelligent data operationengine server 105M may mirror its information onto the mirroredintelligent data operation engine servers 105A-105C. The mirroring mayinclude mirroring the content of the main intelligent data operationengine server database 106M into the mirrored intelligent data operationengine server databases 106A-106C. The main intelligent data operationengine server 105M and the mirrored intelligent data operation engineservers 105A-105C may be located or distributed in various geographicallocations to serve the enterprise facilities in these areas. Forexample, the main intelligent data operation engine server 105M may belocated in Paris, the mirrored intelligent data operation engine server105A may be located in Boston, 105B in Philadelphia, and 105C in NewYork. Additionally, multiple instances of the intelligent data operationengine server 105A, 105B, and 105C may be located at the same geographiclocation with the instances configured to cooperate to speed up queryresponse time and to scale to handle massive data analysis. Asdiscussed, the mirroring of a server computer in one location withanother server computer in another location may be understood as themirroring of a server site with all of its server computers togetherwith associated hardware and content.

The action handler can be used to query a cluster of intelligent dataoperation engine servers. The action handler distributes action commandsamongst the two or more intelligent data operation engine servers,increasing the speed with which actions are executed and savingprocessing time. The action handler monitors activity of eachintelligent data operation engine server and load balances between thetwo or more intelligent data operation engine servers. The actionhandler also distributes actions when a lack of feedback from aparticular intelligent data operation engine server occurs to ensureuninterrupted service if any of the intelligent data operation engineservers should fail.

The action handler is user configurable to decide which mode it will runin: Mirror mode, and Non-mirror.

Mirror mode: The intelligent data operation engine servers that theaction handler distributes command actions in a common protocol to areidentical (that is two or more of the intelligent data operation engineservers in a cluster of servers are exact copies of each other, each oneis configured the same way and contains the same data).

Non-mirror: The intelligent data operation engine servers that theaction handler distributes command actions in a common protocol to aredifferent (that is two or more of intelligent data operation engineservers in the cluster of servers is configured differently and containsdifferent data). When running the action handler in non-mirror mode, theaction handler sets up Virtual Databases that can be of the followingtypes: Combinator and Distributor.

Combinator: The Virtual Database forwards an action command to all thedatabases that it comprises. The action handler collates and sorts theresults before the action handler returns them.

Distributor: The Virtual Database forwards an action command to one ofthe databases it comprises. These databases must be identical (that isall of the databases are exact copies of each other and contain the samedata). The way the action handler forwards the action is determined bythe distribution method.

The distributed index handler can be used to create the common index fora cluster of intelligent data operation engine servers. The distributedindex handler distributes index commands to the two or more intelligentdata operation engine servers, so that index commands are executed morequickly and processing time is saved. The distributed index handler alsodistributes index commands that ensure uninterrupted service when a lackof feedback from a particular intelligent data operation engine serveroccurs to ensure uninterrupted service if any of the intelligent dataoperation engine servers should fail. Connectors index into thedistributed index handler.

The distributed index handler is user configurable to decide which modethat it will run in: Mirror mode, and Non-mirror.

Mirror mode: The distributed index handler distributes all the indexdata it receives to all the intelligent data operation engine serversthat it is connected to. The intelligent data operation engine serversare exact copies of each other that must all be configured in the sameway. At least one intelligent data operation engine server and oneinstance of the distributed index handler should run in mirror mode ifthe facility wants to ensure uninterrupted service when one of theintelligent data operation engine servers should fail. While oneintelligent data operation engine server is inoperable, data continuesto be indexed into its identical copies which at the same time are stillavailable to return data for queries.

Non-Mirror mode: The distributed index handler distributes the indexdata it receives evenly across the designated intelligent data operationengine servers that it is connected to. For example, if the distributedindex handler is connected to four intelligent data operation engineservers, it indexes approximately one quarter of the data into each oneof the intelligent data operation engine servers, (note individualdocuments are not split up). Running the distributed index handler innon-mirror mode assists in instances where servicing an amount of datathat is to be indexed is too large for a single intelligent dataoperation engine server. When the intelligent data operation engineservers that the distributed index handler indexes into are situated ondifferent machines, the index process will require less time.

For some embodiments, a set of two or more intelligent data operationengine servers may work together in a cooperative and distributed mannerto do the work of a common query engine. For example, there may be a setof two or more intelligent data operation engine servers in Bostonconfigured to perform the operations of the common query engine. Thisallows the functionalities of the common query engine amongst the set ofserver computers to be performed in a faster manner.

The distribution of server computers within a given location or sisterlocation helps to improve the identification and response time to postof relevant electronic files. The mirroring of sites with identicalcompositions of hardware and content is done to help improve theidentification and response time. In addition, the mirroring ofidentical server site locations aids in servicing potentially millionsof computing devices by distributing the workload and limiting thephysical transmission distance and associated time. The intelligent dataoperation engine server set is duplicated with the same content andmirrored across the Internet to distribute this load to multipleidentical sites in order to increase both response time and handle thecapacity of the queries by those computing devices.

Next, when the intelligent data operation engine server 105A isoverloaded, the computing devices 110A at a given enterprise facilitymay be connected with the intelligent data operation engine server 105Cin New York because it may not be overloaded even though the intelligentdata operation engine server 105C may be further from the geographiclocation of the enterprise facility than the intelligent data operationengine server 105A.

For some embodiments, a set of intelligent data operation engine serversmay be used to analyze big data stored in repositories and do dataoperation actions on that information such as respond to queries, givecurrent trends, etc. A hierarchical set of filters may be spreadlinearly across the set of intelligent data operation engine servers.These intelligent data operation engine servers may work together incollaboration to process the big data information in order to respond toqueries, project trending information, etc. For example, the intelligentdata operation engine server 105A, the intelligent data operation engineservers 105A-105C may work together to process the information receivedfrom the enterprise facility 110A. The communication channel 151 using acommon protocol between the intelligent data operation engine servers105A and 105C, and the communication channel 152 between the intelligentdata operation engine server 105A and 105B illustrates thiscollaboration. Similarly, the intelligent data operation engine servers105C, 105B, and 105A may work together to process the informationreceived from the enterprise facility 110B. The communication channel151 between the intelligent data operation engine servers 105C and 105A,and the communication channel 153 between the intelligent data operationengine servers 105C and 105B illustrates this collaboration.

The number of intelligent data operation engine servers installed at afacility generally depends on the number of documents and facility usersthat the intelligent platform will be required to handle. Generally, oneintelligent data operation engine server can comfortably analyze, store,and process queries for a set amount of documents or users. To achievean optimal query speed or document processing speed a number of serversshould be combined to scale together with the use of the action handlerand distributed index handler. In addition, it is also useful to installmultiple intelligent data operation engine servers so one or moreintelligent data operation engine servers can be specifically tweakedfor certain categories. Generally, response time is minimized when thedata is spread across multiple intelligent data operation engineservers, so that each intelligent data operation engine server storesone subject area, or data that can be separated into data that is usedfrequently and data that is used infrequently. The facility may storeeach data group on individual intelligent data operation engine serversto speed up the Dynamic Reasoning Engine's (DRE's) response time.

In an example query process, each server in the set of servers appliesfilters to eliminate irrelevant stored electronic files and/or findcorresponding positive matched electronic files in enterprise facilities110A, 110B as possible matches to feature sets of known objects in theobject database. Entire categories of possible matching objects can beeliminated simultaneously, while subsets even within a single categoryof possible matching objects can be simultaneously solved for ondifferent servers. Each server may hierarchically rule out potentiallyknown electronic files on each machine to narrow down the hierarchicalbranch.

The intelligent data operation engine server set contains processingpower that can be distributed across the intelligent data operationengine server set, and applied to the intelligent data operating layerdatabases. The collaboration among the intelligent data operation engineservers may help speed up the data analysis process. For example, eachof the intelligent data operation engine servers may apply filters toeliminate a certain pattern of features as possible matches to featuresof known electronic files stored in the database. Entire categories ofelectronic files may be eliminated simultaneously, while thecollaborating intelligent data operation engine servers maysimultaneously identify subsets even within a single category ofelectronic files as potential matching objects. Further, feedbackcommunication occurs between each intelligent data operation engineserver to help hierarchically rule out potential known electronic filesto narrow down the hierarchical branch and leaf path to determinewhether there is a match.

As discussed, the server computer has a set of one or more databases tostore a scalable database of electronic files. The intelligent platformof intelligent data operation engine servers, distributed indexhandlers, and action handlers described herein enable organizations tounderstand and process big data set information, structured andunstructured, in near real time. The cooperation between the intelligentplatform of intelligent data operation engine servers and distributedindex handlers is able to aggregate and index any form of structured,semi-structured and unstructured data into a single index, regardless ofwhere the file resides. Users of traditional systems may be required tosequentially search in four or five repositories to arrive at anacceptable results list, whereas the intelligent platform enables asingle point of search for all enterprise information (including richmedia), saving organizations time and money. With access to virtuallyevery piece of content, the intelligent platform provides a completeview of an organization's data assets.

In various embodiments, the systems and methods described herein mightnot use a schema. In some cases, a big data set may not have a schemathat will work with all of the data that is stored by a user of thesystem. This data may have many different formats and may include manydifferent types of files. Accordingly, it can be preferable for thesystem to function without a predefined schema. In an embodiment,intelligent data operation engine analyzes the content of structured andunstructured data in electronic files, creates conceptual understandingsof these electronic files, and creates a search index that resides on aserver or sets of servers as an executable application that integrateswith the massive parallel processers and multithreaded nature of thehardware in the server. The intelligent data operation engine, actionhandler, and distributed index handlers may be configured as executableapplications or services running alongside other software.

As illustrated in FIG. 1, the servers 106A, 105A-B, can be distributed.For example, in organizations that are geographically distributed, localreplicas can be automatically created and utilized where possible.Remote copies may only be used when a local system fails, therebybuilding fault tolerance whilst maintaining the benefits of localperformance and a reduction of resource overhead into a single, seamlessservice. In other embodiments, documents may be distributed to differentparts of the system based on age, subject matter, language, etc.

The system illustrated in FIG. 1 can perform load balancing using adistributed index handler and an action handler. With load balancing,data is automatically replicated across multiple servers and userrequests are load-balanced across these replicas, improving performance,reducing latency, and improving user-experience. In some embodiments, asearch can be directed to a proper portion of a system based on what isbeing searched for. For example, in one embodiment more resources may bededicated to recent news articles because more recent news may generallybe of greater interest to many users and, accordingly, it may beimportant to search these stories more quickly and they may be searchedby a greater number of people, perhaps performing a greater number ofsearches. Conversely, fewer resources might be provided for less recentnews stores stored in such a system because fewer searches on thosestores might be performed.

The system illustrated in FIG. 1 may also include a multi-dimensionalindex to provide valuable information to the distribution components. Anexample intelligent data operation engine server precludes bottlenecksand unbalanced peak loads during the indexing and query process. Anembodiment can provide prioritized throttling to preclude bottlenecksand unbalanced peak loads. The prioritized throttling includes factorssuch as: 1) time to maximize index/query performance based on the timeof day (i.e. work hours), 2) location such as prioritizing activitybased on the server landscape, and 3) status such as an assignedpriority status for processing.

An intelligent data operation engine server is configured to applyautomatic advanced analytics, categorization, and clustering to alltypes of data including structured and unstructured data, which enablesorganizations to reduce costs and risks, and collect server data for anypurpose including data analytics, efficient discovery, and legal hold.In an embodiment, the intelligent data operation engine server isconfigured to perform keyword and conceptual searches; speech analytics;video file, social media content, email and messaging content searches.The intelligent data operation engine server may then categorize allthis information, all on the same platform. In an embodiment, anintelligent data operation engine server understands over 1,000 contenttypes from more than 400 repositories in over 150 human languages. Over500 operations can be performed on digital content by the intelligentdata operation engine server, including hyper-linking, clustering,agents, summarization, taxonomy generation, profiling, alerting, andmeaning-based search and retrieval. Because intelligent data operationengine server is at the core of an example system's scalable, modularfamily of product offerings, any product can be seamlessly integratedwith any other system.

Similarly, in various embodiments one or more servers 105M, 105A-105C,may include map and reduce integration such that the intelligent dataoperating layer supports and leverages the Hadoop® ecosystem, combiningthe strengths of such a system and the intelligent data operating layerfor richer analytic computation. Unlike other vendors who simplyre-implement Map/Reduce, the intelligent data operating layer canuniquely leverage additional tools in such systems that includetechnology stack such as Hbase and Hive. Parallel import and export toHDFS allows data transformation to occur in such example systems or theintelligent data operating layer.

The management of structured and unstructured content requires anintelligent data operations platform, such as an intelligent dataoperating layer, to meet the most rigorous performance requirements andthat can also be easily resized or reconfigured commensurate to businessneeds. An operator of the server system has an option to place theinstances of the intelligent data operation engine servers to run in 1)mirroring mode or 2) non-mirroring mode. The operator may placeinstances of the intelligent data operation engine servers to run in 1)mirroring mode in order to scale to process massive amounts of queriesper unit time because the instances are replicated copies of each otherand the processing of the queries is load balanced between thereplicated instances of the intelligent data operation engine servers.The operator may place instances of the intelligent data operationengine servers to run in 2) non-mirroring mode to process queries withless latency between a query request and a search result responsebecause the instances are 1) configured differently or 2) work ondifferent sets of data causing less latency between the query requestand the search result response.

An example system scales to support the large enterprise and portaldeployments in the world, with presence in many markets. The intelligentdata operating layer server analyzes larger sets of data, such as BigData, and returns a manageable set of data back to a user. Since theintelligent data operating layer scalability is based on its modular,distributed architecture, the intelligent data operating layer canhandle massive amounts of data such as on commodity dual-CPU servers.For instance, only a few hundred entry-level enterprise machines may beneeded support a large, i.e. 10 billion record, footprint of big data.The intelligent data operating layer is configured to handle the largersets of data to create improved performance of the intelligent dataoperating layer.

In an example embodiment, a single intelligent data operating layerengine can: support an estimated 100 million documents on 32-bitarchitectures and over 750 million on 64-bit platforms; accurately indexin excess of 100 GB/hour with index commit times (i.e. how fast an assetcan be queried after it is indexed) of sub 5 ms; execute over 2,500queries per second, while querying the entire index for relevantinformation, with sub-second response times on a single machine with twoCPUs when used against 50 million pieces of content; support hundreds ofthousands of enterprise users, or millions of web users, accessinghundreds of terabytes or even petabytes of data; and save storage spacewith an overall footprint of less than 10% of the original file size.

This enhanced scalability results in hardware cost-savings as well asthe ability to address larger volumes of content. Though the intelligentdata operating layer scales extremely well on commodity servers, itsflexible architecture can take full advantage of 1) massive parallelism,symmetric multiprocessing processing capabilities, 2) software platforms(such as Solaris 10, Linux 64, Win64, etc), 3) distributed server farms,and 4) external disk arrays (i.e. NAS, SAN etc) to further improve theirperformance. This flexibility extends to being able to leverageindividual or a combination of these different environments.

The intelligent data operation engine servers provide a commonprocessing layer that allows an organization to form a conceptualunderstanding of information, both inside and outside the enterprise.Based on the intelligent data operating layer, the platform usesprobabilistic algorithms to automatically recognize concepts and ideasexpressed in all forms of information. In an embodiment, the intelligentdata operation engine servers leverage NoSQL (Not Only SQL) technology,which enables enterprises to simultaneously understand and act uponelectronic documents, emails, video, chat, phone calls, and applicationdata moving across networks, the web, the Cloud, smartphones, tablets,and sensors. With over 500 out-of-the box functions and 400 connectors,the intelligent data operating layer advanced pattern-matchingtechnology understands the meaning of all enterprise informationregardless of format, human language, location, subject or quantity anddetects patterns, emotions, sentiments, intent, risks, and preferencesas they happen. The connectors may extract data from content sources,import the data into an IDX format or XML file formats, and then indexthe data into the intelligent data operation engine server or servers. Asingle view into all content allows highly complex analytics to beperformed seamlessly across a variety of data types, repositories, andcommunication channels to dramatically increase the value anorganization can derive from its information.

FIG. 1 and FIG. 2 illustrate an intelligence platform for the processingof large set of both unstructured data of all varieties, such as video,audio, social media, email, text, click streams, log files, andweb-related content and search results, as well as structured data.Distributed processing of big data sets that include structured andunstructured information types across the clusters of intelligentcomputing engines utilizes a simple programming model. As illustrated inFIG. 2, content from various repositories of stored electronic files areaggregated by connectors 200 and then using index processing 202,content of stored electronic files is indexed into the intelligent dataoperation engine server and/or communicated for dissemination acrossmultiple intelligent data operation engine servers 206, through thedistributed index handler.

In an embodiment, an intelligent data operating layer is built onadaptive pattern recognition technology and probabilistic modeling toform a conceptual and contextual understanding of the digital contentand metadata associated with an electronic file, and then later theconceptual understanding of the contextual understanding is refined byextracting meaning from the manner in which people interact with thatelectronic file. Meaning based computing refers to the ability to form aconceptual understanding of all information, whether structured,semi-structured or unstructured, and recognize the relationships thatexist within it. The conceptual understanding of information allowscomputers to harness the full richness of human information, bringingmeaning to all data, regardless of what data structure that informationcomes from or what type of storage repository stores that data. Throughsophisticated functionality and analytics, meaning based computingautomates manual operations in real-time to offer true business value.Meaning based computing extends far beyond traditional methods such askeyword search that simply allow users to find and retrieve data.Keyword search engines, for example, cannot comprehend the meaning ofinformation, so they only find documents in which a specific wordoccurs. The intelligent data operating layer assigns both mathematicalweights and idea distancing positioning in different categories to givean ability to understand information that discusses the same idea (i.e.are relevant) but use semantically different words. Idea distancingshows vital relationship between seemingly separately tagged subjects toincrease findability of information. These example systems also may usestatistical probability modeling that is refined over time of operationof the engine to calculate the probabilistic relationships between termsand phrases. The model of the conceptual understanding then improvesover time to be a more accurate view of the conceptual meaning of adocument based by applying each new use of that conceptual understandingto affect the weighted values on the stored data.

One example of this theory at work is the intelligent platform's agentprofile technology. Users can create agents to automatically track thelatest information related to their interests, and the intelligent dataoperating layer determines the relevance of a document based on themodel of the agent. Adaptive Probabilistic Concept Modeling (APCM)algorithms are also used to analyze, sort and cross-referenceunstructured information. In a similar manner, knowledge about thedocuments deemed relevant by a user to an agent's profile can be used injudging the relevance of future documents.

While some other models start off with an a priori knowledge of thestate of the system and apply training to it, the intelligent platformbegins with a blank slate and allows incoming data to dictate the model.In true Bayesian fashion, the model mixes new information with a growingbody of older content to refine and retrain the engine. Shannon'sInformation Theory uses a mathematical foundation for information to betreated as a quantifiable value in communications. The intelligent dataoperating layer uses Shannon's Information Theory on human languages,which contain a high degree of redundancy or nonessential content. In anembodiment, an intelligent data operating layer also uses Shannon'sInformation Theory to ensure that the terms with weighted values storedin the conceptual understanding of a given piece of information are theones with the most conceptually relevant terms for that storedconceptual understanding representation of the information. It is thecombination of the above two theories that enable the intelligentplatform's software to determine the most important, or informative,concepts within a document. Using the combination of 1) idea distancing,2) Shannon's Information Theory to pick up the most important terms togain a conceptual understanding of a document and understanding thehierarchical structure of an analyzed document itself to give aconceptual understanding of a content of a document (i.e. content foundin a title and/or summary paragraph of a hierarchy of a document aregiven more weight than the content found in a body of a document), and3) a Bayesian statistical modeling that refines assigned mathematicalweights over time, all assist to build an accurate conceptualunderstanding representation of a document.

The intelligent data operating layer may use the above capabilitiesalong with having programmed intelligence to recognize over 1,000different file formats and securely connect to over 400 repositories toprovide advanced and accurate retrieval of the valuable knowledge andbusiness intelligence that already exists within any organization.

As illustrated in FIG. 2, connectors 200 may be used. For example,organizations can democratize data types with the intelligent platform's400+ connectors, as well as new connectors for Social Content and bigdata with support for Facebook®, Twitter®, Hadoop®, and others. Newfunctions for intelligent image and document recognition functionsinclude document comparison for version management, signatureidentification, among other features, and can accommodate dataextraction requirements from complex data sources such as contracts,forms, productivity tools, and spreadsheets. In addition, standardconnectors and flexible APIs assist customers in moving data, creatingreports, dashboards, and queries—and developing applications forimproved query performance and scalability.

In some embodiments, relatively inexpensive commodity machines may beconfigured to run connectors 200, and a more expensive, reliable andhigher performance Enterprise class platform may be configured to runthe core servers.

Enterprises may have storage repositories organized by subject matter ordepartment, for example, a database for sales, a database for finance, adatabase for legal, a database for news, a database for e-mails, andmany more repositories of structured and unstructured data. Theintelligent platform allows these databases from distinctly separatesources to work harmoniously as one shared storage of structured andunstructured data. From a single platform, companies can access andprocess any piece of data in any form, including unstructured data suchas text, email, web, voice, or video files, regardless of its locationor language. For example, language pipeline 204 might be configured toautomatically recognize and understand videos, voice mails, and textdocuments and convert them into an understandable list of important textand phrases and images/symbols in that electronic file. The languagepipeline 204 is discussed further with respect to FIG. 7.

In an embodiment, an intelligent data operation engine server consistsof a server that is scalable, contains one or more processors to supportmulti-threaded processing on advanced pattern-matching technology thatexploits high-performance probabilistic modeling techniques. In anembodiment, a first instance of the action handler is implemented in adistribution server configured to support and convey distributedprotocol action commands to and between the two or more intelligent dataoperation engine servers, which assists a user in scaling theintelligence platform for the processing the large sets of bothunstructured data of as well as structured data in a linear manner,increasing the speed with which actions are executed, and savingprocessing time.

The index processing pipeline contains one or more instances of thedistributed index handler. The distributed index handler can efficientlysplit and index large quantities of data into multiple intelligent dataoperation engine server instances, optimizing performance by batchingdata, replicating index commands, and invoking dynamic loaddistribution. The distributed index handler can perform data-dependentoperations, such as distributing the content by date, which allows formore efficient querying. The query pipeline 208 contains one or moreinstances of a distributed action handler. In some embodiments, theaction handler is a distribution server that allows for the distributionof protocol action commands such as Autonomy Connection Informationprotocol action commands, to and between the multiple intelligent dataoperation engine servers, which assists to allow for scaling systems ina linear manner, increasing the speed with which actions are executedand saving processing time.

The action handler may be a distribution server that allows the user todistribute action commands, such as querying, to the two or moreintelligent data operation engine servers in order to augmentperformance across the intelligent platform. The query pipeline andindex processing pipeline cooperate to improve scalability andperformance on big sets of data containing both structured andun-structured electronic files. The query pipeline and index processingpipeline cooperate with multiple copies of intelligent data operationengine servers to improve scalability and performance on large sets ofdata containing both structured and un-structured electronic files. Theaction handler propagates query actions to the two or more instances ofintelligent data operation engine servers to search the index of contentin the two or more repositories, which further ensures uninterruptedservice in the event of server failure. The action handler uses the twoor more instances of intelligent data operation engine servers as a poolof servers, a primary intelligent data operation engine server isautomatically selected and the action handler switches to secondaryintelligent data operation engine server when the primary intelligentdata operation engine server fails so that service continuesuninterrupted.

The intelligence platform may be selected/configured by the user on howto intelligently distribute work amongst the instances of intelligentdata operation engine servers. For flexibility, both the action handlerand the distributed index handler can be configured by the user to runin mirroring mode (intelligent data operation engine servers are exactcopies of each other) and non-mirroring mode (each intelligent dataoperation engine server is configured differently and contains differentdata). Thus, the user is given an option of how to optimize the multipleinstances of intelligent data operation engine servers to service theirsets of repositories of stored electronic files and the expected volumeof (i.e. amount of) queries the system is expected to process per a unitof time. In the ‘non-mirroring mode’, the system is optimized to handlea maximum expected volume of number/amount of queries per unit time.Each intelligent data operation engine server instance is processing itsown given query and works through the analytics of finding relevantmatching electronic files by itself independent of the other intelligentdata operation engine server instances. Load balancing still occursamongst the multiple instances but the handling of an individual queryanalysis and search result response is handled by a single intelligentdata operating layer instance working on that individual query analysis.

In the ‘mirroring mode’, the system is optimized to handle fewer amountsof queries at a greater response time. All of the intelligent dataoperation engine server instances cooperate to work on different aspectsand parts of handling of an individual query analysis and search resultresponse for that query analysis. Thus the intelligence platform is 1)scalable to suit each user enterprise needs—handle massive amounts ofqueries per unit time or 2) handle fewer queries with less latencybetween the query and the search result response.

The distributed index handler increases performance of both the indexprocessing pipeline and query pipeline by using database statistics andlife cycle management to determine relevant electronic files and placethem into the proper category for indexing, and then later for queryoptimization provides feedback while searching different repositories todetermine the most relevant electronic files. This determination mayoccur by factoring both an amount of relevant documents returning to thekey terms of the query, the strength or percentage relevance of thosereturned documents to the query, and tracked historic data of mostrelevant indexed categories to search for previous similar queries.

The action handler of the query pipeline analyzes the nature of a querywhen possible, for example, a query in the financial data base of anenterprise is most likely wanting financial related structured andunstructured data returned as search results and accordingly documentsindexed in the financial category are searched first. Likewise, a queryabout news items would search for documents indexed in the current newscategory are searched first and lifecycle management tends to put olderdocuments date wise out of that category and into a historical newscategory. The query pipeline increases performance by determining themost rare occurrence terms or noun phrases in the query search terms.Thus, the action handler of the query pipeline is configured to analyzea nature of a content of a query when possible and the query pipelineincreases performance by determining the most rare occurrence terms ornoun phrases in the query search terms.

The query pipeline analyzes a statistically most rare occurrence ofterms or noun phrases present in the query search terms such that anature of the query can determine a most relevant sub-portions of thecommon indexed structured data and unstructured data to begin the searchin and send command actions to two or more intelligent data operationengine servers to focus a most amount of processing power in analyzingthe electronic files in the storage repositories containing theserelevant sub-portions. The query pipeline focuses the majority of totalprocessing power of the distributed two or more intelligent dataoperation engine servers to find relevant electronic files most relevanttowards the most rare occurrence terms, noun phrases, and/or termpairings, in the query search terms. The more common search terms tendto bring back a wider swath of documents that are not relevant that needto be analyzed. However, by starting the search response on the indexedstructured data and unstructured data weighing heavily on the most rareoccurrence terms, noun phrases, and/or term pairings tends to rapidlynarrow the volume of potentially relevant documents/electronic filesthat need to be analyzed for determining a mathematical number of theirrelevance in relation to the query so a ranked list of relevantstructured and unstructured data can be presented back as searchresults.

The intelligence platform keeps electronic files with the entirety ofthat file but inserts statistical weights for relevance to theconceptual understanding of that document into the tiers of term level,corpus level, and document level and then places the representationunderstanding of that electronic file into the appropriate category. Theintelligence platform uses feedback when searching for matching contentbetween structured and unstructured documents. In addition, theDistributed Service Handler (DiSH) component allows effective auditing,monitoring and alerting of all other the components in this distributedplatform. In some embodiments, the DiSH may be used for a single pointof control, monitoring and configuration of all of the components inthis distributed platform.

In addition, the integration of the indexes for structured andunstructured data has been performed in a way that ensures maximumscalability. Techniques for storing structured data are well establishedas part of database systems, with more recent developments towardscolumn-based storage allowing increasingly rapid evaluation of certaintypes of query. Rather than allowing separate “databases” to index thestructured and unstructured data, intelligent data operation engineserver has been designed to handle and store both types of data (i.e.structured and unstructured data) into a single system. This utilizesexisting advances in structured data storage with the intelligent dataoperation engine server instances' statistical database technology, andcombines them to allow an immediacy of interaction that is able tooptimize queries that are evaluated against a corpus containing bothstructured and unstructured data simultaneously.

In the illustrated embodiment, all of the intelligent data operationengine server instances use a common protocol, such as Autonomy ConnectInformation (ACI) protocol, to talk to each other and present a commonApplication Programming Interface (API) 210.

In various embodiments of the systems and methods described herein,performance and capacity can be essentially doubled by replicating theexisting machine coordinating their efforts. This allows scalingpredictions to be made without worry about bottlenecks.

Some embodiments deliver linear scalability by use of its distributionmodel, which allows additional machines, locations and indexes to appearas one. In addition, an intelligent data operating layer may includedistributed components that are uniquely ‘Geo-efficient’. Geoefficiencypermits completely fault tolerant national, trans-national andtrans-global architectures to be assembled with ultimate flexibly incomponent placement. The intelligent data operating layer's distributedcomponents distributed index handler and action handler form a coherentlayer within the intelligent data operating layer. The distributed indexhandler and action handler components can be placed inline or withinfully nested topologies. Heterogeneous hardware/OS and networkenvironments are fully supported with both the action handler anddistributed index handler able to act in isolation or cooperatively tointelligently process index and query traffic. The action handler anddistributed index handler cooperating support distributed index andquery commands as well as data payload that can automatically bescheduled, mirrored, throttled, queued, and recovered.

These systems and methods may support mirroring and fail over processes.For example, the action handler uses the two or more instances ofintelligent data operation engine servers as a pool of servers. Aprimary intelligent data operation engine server is automaticallyselected and the action handler switches to secondary intelligent dataoperation engine server when the primary intelligent data operationengine server fails so that service continues uninterrupted.

The intelligent data operating layer's cooperation with the distributedindex handler allows companies to cost-effectively outsource the storageand management of emails, electronic documents, rich media files,instant messages, and all forms of web content. The intelligent dataoperating layer's cooperation with the distributed index handler allowsoperation across for example, 25,000, production servers hosted in datacenters located geographically around the world. The intelligent dataoperating layer's cooperation with the distributed index handlerprovides security and scalability in the cloud, adhering to globalcertification standards such as SAS 70 Type II, PCI DSS, US DOD 5015.02,UK TNA2002, and Australia's VERS. Two or more fully mirrored,geographically separate systems can provide complete data and systemredundancy as well as parallel processing of all tasks.

The intelligent data operating layer's cooperation with the distributedaction handler provides a high degree of reliability through fail-overmechanisms built into the distribution components. The action handlerallows for example, 100%, uptime from a pair of two or more intelligentdata operation engine servers, while the distributed index handlerensures data integrity across the pair of two or more intelligent dataoperation engine servers.

The Distributed Service Handler (DiSH) component is configured tosupport high availability deployments, so that administrators arealerted to potential faults or when maintenance may be required. TheDistributed Service Handler (DiSH) component allows effective auditing,monitoring and alerting of all other intelligent platform components.The Distributed Service Handler can be used to alert to critical errors,sizing boundaries or extraordinary events, thereby automatically keepingadministrators aware when there are problems, when the limits of thecurrent system are close to being reached, and when unexpected eventsoccur.

The illustrated server provides for instruction-level parallelism. Anintelligent data operating layer server programmatically expressesitself as an expanding collection of operations. These operations canand are executed in serial pipeline form yet the inherent logic ofsimultaneously processing disparate forms of unstructured,semi-structured and structured data requires a high degree ofparallelism. Not only does the intelligent data operating layer need toingest multiple streams and types of data, the intelligent dataoperating layer must also provide a real-time answer or decision againstthat data as it is indexed rather than force the user to wait anarbitrary period until serially accessed resources becomes available. Asa consequence, the intelligent data operating layer has been designedwith instruction-level parallelism (ILP) as the core of its process andoperation model. ILP by definition is limited by the serial instructionmodel of scalar processors and thus the intelligent platform uses formsof parallel architecture including multi-CPU, hyper-threading and nowsingle die multi-core processing.

The intelligent data operating layer engine's default process model ismulti-threaded (using a configurable number of threads). The intelligentdata operating layer operations can either be grouped by class, withindexing and querying performed by separate threads or for n-core modelsa single operation can be “atomized” into multiple threads. Concurrentquerying and indexing is the default with no requirement whatsoever for“locking” any part of the indexes while querying takes place. All majormulti-core manufacturers are supported, including, for example, Intel,AMD and Niagara offerings from Sun Microsystems.

The intelligent platform may use multi-core strategies as a key tocrossing the consumer “teraflop” threshold. The intelligent platform mayuse “coalition” simulations of split thread intelligent data operatinglayer operations against n-core “battalion” processor units that blendgeneral-purpose cores with more specialist cores such as those dedicatedto signal processing. These blended core units in the engine may beteraflop chips. The intelligent platform may use thread models thatdynamically co-opt different core types to act in “coalition” to performthe simultaneous deconstruction and analysis of unstructured sourcessuch as video that combine visual and auditory attributes.

The system illustrated in FIG. 1 can be scaled for increasedperformance. An embodiment scales to support the largest enterprise-wideand portal deployments in the world, with presence in virtually everyvertical market. Since these systems and methods scalability is based onits modular architecture, it can handle massive amounts of data oncommodity dual-CPU servers. An embodiment of the intelligent dataoperating layer delivers linear scalability through a multi-threaded,multi-instance approach with load-balancing to intelligently distributethe indexing and query workload.

Intelligent Data Operating Layer (IDOL) Server

FIG. 3 illustrates an example block diagram of some modules of anintelligent data operation engine server, in accordance with someembodiments. Intelligent data operation engine server 300 may includeautomatic hyperlinking module 305, automatic categorization module 310,automatic query guidance module 315, automatic taxonomy generationmodule 320, profiling module 325, automatic clustering module 330, andconceptual retrieval module 335. The automatic hyperlinking module 305is configured to allow manual and fully automatic linking betweenrelated pieces of information. The hyperlinks are generated in real-timeat the moment the document is viewed. The automatic categorizationmodule 310 is configured to allow deriving precise categories throughconcepts found within unstructured text, ensuring that all data isclassified in the correct context.

The automatic query guidance module 315 is configured to provide querysuggestions to find most relevant information. The automatic queryguidance module 315 identifies the different meanings of a term bydynamically clustering the results into their most relevant groupings.The automatic taxonomy generation module 320 is configured toautomatically generate taxonomies, such as an XML schema, and instantlyorganizes the data into a familiar child/parent taxonomical structure.The automatic taxonomy generation module 320 identifies names andcreates each node based on an understanding of the concepts with thedata set as a whole. The profiling module 325 is configured toaccurately understand individual's interests based on their browsing,content consumption and content contribution. The profiling module 325generates a multifaceted conceptual profile of each user based on bothexplicit and implicit profiles.

The automatic clustering module 330 is configured to help analyze largesets of documents and user profiles and automatically identify inherentthemes or information clusters. The automatic clustering module 330 evenclusters unstructured content exchanged in emails, telephoneconversations and instant messages. The conceptual retrieval module 335is configured to recognize patterns using a scalable technology thatrecognizes concepts and find information based on words that may not belocated in the documents. It should be noted that the intelligent dataoperation engine server 300 may also include other modules and featuresthat enable it to work with enterprise facilities.

FIG. 4 illustrates an example block diagram of an embodiment of anintelligent data operation engine server generating a conceptualrepresentation of an electronic file. The creation of a conceptualrepresentation of an electronic file not only extracts key text andmetadata to assist in categorization of and taxonomy generation for theelectronic file and also matching to relevant queries, but alsopreserves and intelligently processes all content, intelligence andmetadata relationships that reside within the electronic file. Theinformation powering today's organizations exists in two forms:structured and unstructured. The structured data requires humans toadapt the information to fit a format needed by machines. Manualclassification augments structured information with the nuances andcomplexity that computers could not grasp, because people do not speakin zeroes and ones. The nature of human communication is complex, usinglanguage and idioms, photographs and videos, recordings and social mediainteractions. Human information is unstructured and does not fit intothe neat rows and columns of relational databases. Yet, unstructuredcontent from the web and other sources represents the fastest growingsegment of the world's content. An intelligent data operation engineserver allows organizations to eliminate the inefficiencies of themanual issues associated with creating XML tags by understanding thecontent and purpose of either the tag itself, related information, orboth.

The index processing pipeline develops a conceptual understanding ofstructured and unstructured electronic files, assigns statisticalweights to the terms, noun phrases, etc. making up the electronic fileindicating the relevance of those terms, noun phrases, etc. to theoverall understanding of that electronic file. Those weights on a termlevel, corpus level, and document level are then associated with thatelectronic file as well as indexed into one or more categories that theelectronic file primarily falls into, for example, categories determinedby subject matter, categories determined by age, categories determinedby human language the electronic file is spoken and/or written in.Electronic document files can be text files, e-mails, electronic files,video files, audio files, instant messages, etc.

In an embodiment, the intelligent data operating layer natively indexesall documents directly into XML into the engine and assigns XML tagsinto the conceptual understanding representation of that document whilemaintaining any original XML tags. This allows interoperability betweenapplications that use different XML tagging rules because the originalXML tags are still there while the inserted XML tags help to connect thedots between any two XML tag structures. With the documents original XMLtags and the natively indexed tags inserted all documents of all types,structured and unstructured, can be stored in single type of database.However since the common XML tags have been inserted into the conceptualunderstanding of all types of electronic files, from a single platform,companies can access and process any piece data in any form, includingunstructured data such as text, email, web, voice, or video files,regardless of its location, format of the unstructured document, orhuman language.

For example, block 402 is an XML parser. An XML electronic file 416 canbe an input to the XML parser 402. Alternatively, a non-XML electronicfile 418 can be converted to XML 414 and provide an input to XML parser402. The XML parser can then parse the converted electronic file (whichis XML) or the XML electronic file 416 for input to the intelligentlayer 406.

The embodiment illustrated in FIG. 4 includes a hierarchal map module ofthe document 404 that generates a hierarchal map of each electronicfile. Generally, the module 404 outlines the electronic file to capturea conceptual understanding of the document. For example, the intelligentlayer can determine various concepts 422, 426 from the content map 408,which can include XML fields for various content of the document such as“A” content description 420, “B” content description 424, etc. Thesecontents 420, 424 can then provide the concepts 422, 426 for thehierarchal map. Tag reconciler 410 can reconcile tags for the contentdescriptions between the content map module 408 and the hierarchal mapmodule of the document. The content map module 408 also tries to placesimilar concepts close to each other to take advantage of ideadistancing. In this example way, the intelligent data operation engineserver can generate a conceptual representation of an electronic file.

The classification portion identifies concepts in the data in anelectronic file, and series of electronic files, and uses them to buildclusters of related information. Taxonomy generation builds ahierarchical structure, from both these clusters and/or from the resultsof a query to the IDOL engine over time, which aids in the directorystructure and category hierarchy. The category hierarchy containscategories that classification portion builds from concepts identifiedby a user and/or imported by taxonomy generation.

In the illustrated embodiment, query handler 412 can receive userqueries and provide input to the intelligence layer 406. In this way, auser can perform a search using the systems and methods describedherein. Based on the search results the intelligent layer 406 canprovide XML output document 428 to the user based on the query 412.

More details with respect to the intelligent layer 406 are discussedwith respect to FIGS. 5A and 5B. The intelligent layer 406 can takeadvantage of instruction-level parallelism and multi-threadedprocessing.

Instruction-Level Parallelism

An example system may programmatically express itself as an expandingcollection of operations. These operations can and are executed inserial pipeline form yet the inherent logic of simultaneously processingdisparate forms of unstructured, semi-structured and structured data maybe amenable to a high degree of parallelism. Not only is an examplesystem capable of ingesting multiple streams and types of data, it mayalso provide a real-time answer or decision against that data as it isindexed rather than force the user to wait an arbitrary period untilserially accessed resources become available.

As a consequence, the intelligent data operating layer may be designedwith instruction-level parallelism (ILP) as part of its process andoperation model. ILP by definition is limited by the serial instructionmodel of scalar processors; and thus, the intelligent platform may useall forms of parallel architecture from multi-CPU, hyper-threading,and/or single die multi-core processing.

The engine's default process model may be multi-threaded (using aconfigurable number of threads). An example system can includeoperations that can either be grouped by class, with indexing andquerying performed by separate threads for n-core models a singleoperation can be “atomized” into multiple threads. Concurrent queryingand indexing is the default with no requirement whatsoever for “locking”any part of the indexes while querying takes place. The servers use manymulti-core hardware parts and multiple threaded techniques.

FIGS. 5A-5B illustrate an example flow diagram of a process to createone or more conceptual understandings/representations of each electronicfile. In step 500, the example method creates one or moreconceptual-understanding representations of each electronic file storedby the following steps. The index processing pipeline develops aconceptual understanding of both structured data and unstructured dataelectronic files.

In step 505, inconsequential information from the supplied content iseliminated. For example, information not used for indexing or for theconceptual understanding of structured and unstructured electronic filesmay be eliminated. Typically, each human language has many very commonwords that add little to conceptual understanding of the document suchas ‘the’, ‘a’, ‘an’, ‘and’, many verbs, etc. A filter containing thislist of words may eliminate these inconsequential words from the contentof the electronic file.

In step 510, a set of key terms is determined. The set of terms mayinclude singular terms, higher order terms, noun phrases, or propernames. For example, when the index processing pipeline develops aconceptual understanding of structured and unstructured electronic filesit can assign statistical weights to the terms, noun phrases, etc.making up the electronic file indicating the relevance of those terms,noun phrases, etc. to the overall understanding of that electronic file.Those weights on a term level, corpus level, and document level are thenassociated with that electronic file as well as indexed into one or morecategories that the electronic file primarily falls into, for example,categories determined by subject matter, categories determined by age,categories determined by human language the electronic file is spokenand/or written in. Electronic files can be text files, e-mails,electronic files, video files, audio files, instant messages, etc. Anumber of factors affect an assigned weight such as the number of timesa word occurs in the electronic file and the word's position if ahierarchical structure exists in the electronic file. For example, wordsin a title and/or abstract paragraph are assigned a higher weight thanthose merely found in the body of the document.

In step 512, a frequency of occurrence weight is assigned to each mainterm and higher order combination of terms in each sentence of thedocument, apply Bayesian theories and then associate one or moreweighted values with each term. Apply an adaptive concept modelingweight based on hierarchy the structure of the document.

In step 560, a mathematical indication of whether the content relates tothe category is determined. This can be based on the key terms (step510) and the frequency of occurrence weight assigned (step 512).Further, this is refined when people's search query terms are analyzedeach time this document is selected by a user. The electronic file ismoved closer to categories corresponding to the people's search queryterms who have selected that electronic file as a relevant document tothe search query.

In step 570, the one or more conceptual representations of the contentin the electronic file are stored. For example, the conceptualrepresentation may be stored on a server such as those illustrated inFIG. 1.

In step 580, the assigned weighted values over time as the conceptualrepresentation is used and matched are stored. For example, anembodiment can assign both mathematical weights and positioning indifferent categories to give an ability to understand information thatdiscusses the same idea (i.e. are relevant) but use semanticallydifferent words.

In step 590, the example method correlates to a semantically similarrepresentation. An example embodiment can use idea distancing (vitalrelationship between seemingly separately tagged subjects) to increasefindability of information. Similar subject matters are placed close toeach other in the logical space of the relational database. All animaltypes, such as lions, tigers, and bears, will be placed close to eachother under the umbrella of animal. Although lions, tigers, and bearsare different sub-categories but are closely related in idea distance.

In step 595, the refined conceptual representation over time is stored.For example, an embodiment can use Information Theory to ensure theterms having weighted values are the most conceptually relevant termsfor that stored conceptual understanding representation of the document.The process ends at end block 599.

FIG. 6 illustrates an example flow diagram of a process in the querypipeline to match a query to the existing document representations.

In step 601, the example method eliminates the inconsequentialinformation from the supplied query content and creates a conceptualrepresentation of the supplied query content. After step 601, theexample method may perform adaptive probabilistic concept caching. Inthe adaptive probabilistic concept, caching frequently-used concepts aremaintained in memory and query results are returned as quickly andefficiently as possible. Two of the key factors in any deployment arequery and index performance. The intelligent platform's AdaptiveProbabilistic Concept Caching algorithm ensures that frequently usedconcepts are maintained in memory caches and that query results arereturned as quickly and efficiently as possible. The intelligentplatform also uses multi-tier caching, ensuring that the minimum numberof operations is performed to provide the functionality required.Intelligent Advanced Probabilistic Conceptual Multi-Tier (APCMT) cachingis used in multiple parts of the information processing pipeline toensure the most efficient response is given from the most efficientcomponent as quickly as possible. This also ensures that the individualpieces of information that can be cached are cached, and thatinformation that is time critical and cannot be cached can be excludedfrom the scheme.

In step 603, the example method identifies and correlates the main termsof conceptual representations common to the content in both the queryinput and the fields in the stored XML document representation. Both thedocuments original XML tags and the natively indexed tags inserted inall documents of all types, structured and unstructured can be stored insingle type of database and this helps to correlate main terms.

In step 604, the example method determines the probability of thosecommon main terms occurring together in a given sentence. For example,an embodiment can use statistical probability modeling to calculate theprobabilistic relationships between terms and phrases. The model of theconceptual understanding then improves over time to be more accurate theview of the conceptual meaning of a document based applying each new useof that conceptual understanding to affect the weighted values on thestored data.

In step 606, the example method selects sentences that contain thelargest amount of semantically similar terms shared by both the queryrepresentation and the field representation.

In step 610, the example method chooses the one or more representationswith similar content from the stored representations of the XMLdocuments and assign a relational probability of relatedness to thecontent in the query input.

FIG. 7 illustrates an example block diagram of an embodiment of anintelligent data operation engine with a speech recognition engineportion.

As illustrated in the example block diagram, an audio information stream702 can be input to the example system, for example, by a person 750talking on a mobile electronic device, a pod cast on the Internet, radioand television shows, etc. or some other form of unstructured data.

The audio information stream input 702 can be identified using languageidentification engine 744 and speech recognition models 706. The speechrecognition models 706 can include filters to differentiate between, forexample, U.S. English 711, U.K. English 710, Columbian Spanish 712, andEuropean Spanish 709, to name a few examples. Additionally, speechrecognition models 706 can be used to determine if the input 702 isaudio sound 714 and for speaker recognition 728 to allow for furtheraudio information stream input 702 processing as the systems and methodsdescribed herein are not limited to audio information. Rather thesesystems and methods may be used in conjunction with both unstructureddata of all varieties, such as video, audio, social media, email, text,click streams, log files, and web-related content and search results, aswell as structured data.

Speech recognition models 706 can be controlled by index control 704 tooutput XML based on the audio information stream input 702. This XMLoutput can then be processed using the systems and methods describedherein. For example, the XML output can be directed to a storage devicesuch as relational databases and alternative databases 716. The XMLoutput can also be directed to the intelligence engine 720 forprocessing in accordance with the systems and methods described herein.

An example information platform can include a single processing layerthat enables organizations to extract meaning and act on all forms ofinformation, including audio, video, social media, email and webcontent, as well as structured data such as customer transaction logsand machine-based sensor data. The platform combines intelligenceengine's 720 infrastructure software for automatically processing andunderstanding unstructured data with the high-performance, real-timeanalytics engine for extreme structured data. A single processing layerprovides for conceptual, contextual, real-time understanding of alldata, inside and outside an enterprise.

The intelligence engine 720 pattern-matching powered by statisticalalgorithms to assign weights to a set of terms, corpus, and documentlevel tiers and then indexes these conceptualunderstandings/representations of an electronic file and forms clustersof all of the electronic files that convey a similar concept based onthe form conceptual understanding of the electronic file into aparticular category so the intelligence engine 720 can also recognizedistance in ideas and concepts and does this in near real time.Manage-in-Place technology indexes all data where it resides eliminatingcopying requirements, storage costs, and hand-off risks, for example, byinterfacing with databases 718. A NoSQL interface provides singleprocessing layer for cross-channel analytics of structured andunstructured data. The intelligence engine 720 uses performanceenhancements for the Analytics Platform including sub-queries, databasestatistics, life cycle management to determine relevant electronic file,query optimization, data re-segmentation, and join filtering.

The systems and methods described herein can include a languagepipeline, configured to automatically recognize and understand videos,voice mails, and text documents and convert them into an understandablelist of important text, phrases and images/symbols in that electronicfile.

All of the intelligence engines 720 (Intelligent data operation engineserver instances) use a common protocol to talk to each other such asAutonomy Connect Information Application Programming Interface.

As illustrated in FIG. 7, an example system may process queries providedby user 750 to an example Windows specific system 724 through querycontrol user interface 730. The queries can be directed through C.O.M.query handler 732 and the platform independent HTTP user API to theintelligence engine 720 and to the relational databases and alternativedatabases 716. Based on the query a response may be routed from database718, intelligence engine 720 or relational databases and alternativedatabases 716 through the query control user interface 730 back to user750.

The system illustrated in FIG. 7 can include mapped security. Generally,the biggest single constraint on scalability within enterpriseapplications can be the ability to manage entitlement checks in ascalable manner. An example intelligent data operating layer storessecurity information in its native form directly in the kernel of theengine itself, with automatic updates to keep the security data current.This sharply contrasts with other security models that store securityinformation in the original repositories, requiring communicationbetween the search engine and the underlying repository for everypotential result at the time of query.

An embodiment allows scaling without impeding performance, includingsearch each document in its entirety. This allows users to retrievevaluable information from every part of the document/video file/audiofile/database.

In an embodiment, scaling can be performed by indexing documents suchthat they may be searched and located and then queuing appropriatematerial based on searching the index. The structured data andunstructured data in the different storage repositories is indexed andorganized within that single common index. Each conceptualunderstanding/representation of an electronic file has pointers to theactual stored structured data and unstructured data.

The systems and methods described can provide for conceptualretrieval—built on an innovative pattern-recognition technology, theintelligent data operating layer offers higher degrees of accuracy andsophistication using scalable technology that recognizes concepts ratherthan simply relying on words in the document.

From a single platform, companies can access and process any piece datain any form, including unstructured data such as text, email, web,voice, or video files, regardless of its location or language.

Some portions of the detailed descriptions herein are presented in termsof algorithms and symbolic representations of operations on data bitswithin a computer memory. These algorithmic descriptions andrepresentations are the means used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of steps leading to a desiredresult. The steps are those requiring physical manipulations of physicalquantities. Usually, though not necessarily, these quantities take theform of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like. These algorithms may be written in a numberof different software programming languages such as C, C++, Java, orother similar languages. Also, an algorithm may be implemented withlines of code in software, configured logic gates in software, or acombination of both. In an embodiment, the logic consists of electroniccircuits that follow the rules of Boolean Logic, software that containpatterns of instructions, or any combination of both.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the following discussion,it is appreciated that throughout the description, discussions utilizingterms such as “processing” or “computing” or “calculating” or“determining” or “displaying” or the like, refer to the action andprocesses of a computer system, or similar electronic computing device,that manipulates and transforms data represented as physical(electronic) quantities within the computer system's registers andmemories into other data similarly represented as physical quantitieswithin the computer system memories or registers, or other suchinformation storage, transmission or display devices.

The present disclosure also relates to apparatus for performing theoperations herein. This apparatus may be specially constructed for therequired purposes, or it may comprise a general-purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program may be stored in a computerreadable storage medium, such as, but is not limited to, any type ofdisk including floppy disks, optical disks, CD-ROMs, andmagnetic-optical disks, read-only memories (ROMs), random accessmemories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any typeof media suitable for storing electronic instructions, and each coupledwith a computer system bus. Portions of any modules or componentsdescribed herein may be implemented in lines of code in software,configured electronic circuits, or a combination of both, and theportions implemented in software are tangibly stored on a non-transitorycomputer readable medium, which stores instructions in an executableformat by a processor.

The algorithms and displays presented herein are not inherently relatedto any particular computer or other apparatus. Various general-purposesystems may be used with programs in accordance with the teachingsherein, or it may prove convenient to construct more specializedapparatus to perform the required method blocks. The required structurefor a variety of these systems will appear from the description below.

Although embodiments of this disclosure have been fully described withreference to the accompanying drawings, it is to be noted that variouschanges and modifications will become apparent to those skilled in theart. Such changes and modifications are to be understood as beingincluded within the scope of embodiments of this disclosure as definedby the appended claims. For example, specific examples are provided forshapes and materials; however, embodiments include those variationsobvious to a person skilled in the art, such as changing a shape orcombining materials together. Further, while some specific embodimentsof the disclosure have been shown, the disclosure is not to be limitedto these embodiments. For example, several specific modules have beenshown. Each module performs a few specific functions. However, all ofthese functions could be grouped into one module or even broken downfurther into scores of modules. Most functions performed by electronichardware components may be duplicated by software emulation and viceversa. A processor may be a central processing unit, a multiple core andmultiple threaded processor, a digital signal processor, and othersimilar component configured to interpret and execute instructions. Thedisclosure is to be understood as not limited by the specificembodiments described herein, but only by scope of the appended claims.

What is claimed is:
 1. An intelligence platform for processing big datasets of both unstructured data and structured data, comprising: a set ofdistributed servers configured for distributed processing of data setsincluding both structured and unstructured data types across two or moreintelligent data operation engine servers where a first instance of anintelligent data operation engine server is configured to usestatistical algorithms to assign weights to 1) terms found in a contentof an electronic file, 2) a corpus of the electronic file, 3) a documentlevel tier of the electronic file, and 4) any combination of the three,where the first instance of the intelligent data operation engine serveris also configured to use idea distancing between similar conceptsstored in a relational database, where the first instance of theintelligent data operation engine server is configured to use both thestatistical algorithms and idea distancing in order to form a conceptualunderstanding of content in each electronic file, and then the firstinstance of the intelligent data operation engine server is configuredto cooperate with a distributed index handler to index the conceptualunderstanding of the electronic file, where a common index for anaggregate amount of electronic files in the big data sets includes atleast a conceptual understanding of a first electronic file thatcontains structured data as well as a conceptual understanding of asecond electronic file that contains unstructured data, and thedistributed index handler is configured to communicate the conceptualunderstandings of the electronic files across the two or moreintelligent data operation engine servers; two or more storagerepositories to store the aggregate amount of electronic files in thebig data sets; and a query pipeline that includes one or more instancesof an action handler that is configured to support a distribution ofcommand actions in a common protocol to the two or more intelligent dataoperation engine servers, where the query pipeline and the distributedindex handler are configured to cooperate with the two or moreintelligent data operation engine servers.
 2. The intelligence platformof claim 1, wherein a first instance of the action handler isimplemented in a distribution server configured to support and conveydistribution of the command actions in the common protocol both to andbetween the two or more intelligent data operation engine servers, whichassists in scaling the intelligence platform for the processing the bigdata sets of unstructured data and structured data in a linear mannerand increasing a speed with which command actions are executed by acombined processing power of the two or more intelligent data operationengine servers.
 3. The intelligence platform of claim 1, wherein thefirst instance of the intelligent data operation engine server uses boththe statistical algorithms to assign weights to all three of the terms,the corpus, and the document level tiers as well as the idea distancingprocess to form clusters of all of the electronic files that convey asimilar concept based on the formed conceptual understanding of theelectronic file fitting into a similar category so the intelligence dataoperation engine can also recognize logical distance in ideas andconcepts and does both of these operations for each conceptualunderstanding that is created.
 4. The intelligence platform of claim 1,wherein the distributed index handler is further configured to employ alife-cycle management algorithm to perform data-dependent operationsthat search the electronic files in the two or more storage repositoriesby a most recent date, which allows for more efficient querying; andwhere an index processing pipeline contains one or more instances of thedistributed index handlers, wherein a first instance of the distributedindex handler is configured to split and index data into the two or moreinstances of the intelligent data operation engine servers, optimizeperformance by batching data for each intelligent data operation engineserver to process, replicate all index commands between the intelligentdata operation engine servers, and perform dynamic load distribution forindexing between the intelligent data operation engine servers, wherethe big data sets have at least a hundred gigabytes of data; and wherethe distributed index handler is configured to cooperate with the two ormore intelligent data operation engine servers to coordinate scalabilityand performance on the data sets containing both structured andun-structured electronic files represented in the common index.
 5. Theintelligence platform of claim 1, wherein the query pipeline isconfigured to analyze a statistically most rare occurrence of terms ornoun phrases present in the query search terms such that a nature of thequery can determine a most relevant sub-portions of the common indexedstructured data and unstructured data to begin a query search in andsend command actions to the two or more intelligent data operationengine servers to focus a most amount of processing power in analyzingthe electronic files in the storage repositories containing these mostrelevant sub-portions, where electronic files with unstructured dataincludes varieties including as video files, audio files, data fromon-line social media sites, emails, text messages, click streams, logfiles, web-related content, search engine search results, and anycombination of these.
 6. The intelligence platform of claim 1, whereinthe query pipeline focuses a majority of total processing power of thedistributed two or more intelligent data operation engine servers tofind electronic files most relevant towards a most rare occurrence ofterms, noun phrases, term pairings, and any combinations of these threein the query search terms, and wherein the query pipeline starts asearch response on the common index of structured and unstructured dataweighing heavily on the most rare of occurrence terms, noun phrases, andterm pairings tends to rapidly narrow a volume of electronic files thatare potentially most relevant, which need to be analyzed for determininga mathematical number of their relevance in relation to the query searchterms so a ranked list of relevant electronic files can be presentedback as search results.
 7. The intelligence platform of claim 1, furthercomprising: where the two or more repositories of stored electronicfiles are organized by subject matter, including, a database for sales,a database for finance, a database for legal, a database for news, adatabase for e-mails, and any combination of two or more of these, andwherein the repositories include both repositories for structured dataand repositories for unstructured data, and wherein the intelligentplatform allows these repositories from distinctly separate sources towork harmoniously as one shared storage of structured and unstructureddata by creating conceptual representations in an XML format of eachelectronic file in these databases such that from the common index anypiece data in any form, including unstructured data, can be accessed andqueried regardless of its storage location or human language the data inthe electronic file is in.
 8. The intelligence platform of claim 1,wherein the first instance of the intelligent data operation engineserver includes an executable dynamic reasoning engine that isconfigured to interact with one or more processors implementingmulti-threaded processes of a server to generate the conceptualunderstandings of the electronic documents, execute queries on theelectronic documents stored in the two or more storage repositories, andreturn a most relevant set of electronic files in a search results tothe query.
 9. The intelligence platform of claim 1, wherein conceptualunderstandings of each of the electronic files stored in the two or moredifferent storage repositories are indexed and organized within thecommon index, and each conceptual understanding of an electronic filethat is organized within the common index has one or more pointers tothe actual stored electronic file containing structured data,unstructured data, and any combination of both.
 10. The intelligenceplatform of claim 1, wherein the action handler is further configured topropagate query actions to the two or more instances of intelligent dataoperation engine servers to search the common index of electronic filesin the two or more repositories, where the action handler is furtherconfigured to use the two or more intelligent data operation engineservers as a pool of servers, where a primary intelligent data operationengine server is automatically selected in the pool of servers and theaction handler is configured to switch to a secondary intelligent dataoperation engine server when the primary intelligent data operationengine server fails so that service continues uninterrupted in an eventof a failure.
 11. The intelligence platform of claim 1, wherein eachinstance of the intelligent data operation engine server is configurableby a user to run in either mirroring mode, where intelligent dataoperation engine servers are exact copies of each other, or run innon-mirroring mode, where two or more of the intelligent data operationengine servers are configured differently than each other and containdifferent data, in order to intelligently distribute work and loadbalance amongst the instances of intelligent data operation engineservers, where the action handler and the distributed index handler canbe configured by the user to understand which instances are running inmirroring mode and non-mirroring mode to scale to handle massive amountsof queries per unit time or 2) handle fewer amounts of queries with lesslatency between a query request and a search result response.
 12. Theintelligence platform of claim 1, wherein the intelligence platformincreases performance of the both the index processing pipeline andquery pipeline by using database statistics and life cycle management todetermine relevant electronic files and place them into the propercategory for indexing, and then later for query optimization providingfeedback between the intelligent data operation engine servers whilesearching different repositories to determine the most relevantelectronic files.
 13. The intelligence platform of claim 12, wherein,the first instance of the intelligent data operation engine server isconfigured to make the determination of which electronic files are mostrelevant by factoring 1) an amount of relevant documents returning tokey terms of a query, 2) a strength rating or percentage relevance ofthose returned documents to the query, and 3) tracked historic data of amost relevant indexed categories to search for previous similar queries.14. The intelligence platform of claim 1, wherein the action handler isconfigured to send the command action to process a query amongst acluster of the intelligent data operation engine servers, where theaction handler distributes the action commands amongst the two or moreIDOL servers to increase a speed with which the command actions areexecuted, and where the action handler is configured to monitor activityof each intelligent data operation engine server and load balanceexecution of command actions between the two or more intelligent dataoperation engine servers.
 15. The intelligence platform of claim 14,where the action handler is also configured to distribute the commandactions when a lack of feedback from a particular intelligent dataoperation engine server occurs to ensure uninterrupted service when anyof the intelligent data operation engine servers should fail, where theaction handler is user configurable to select which mode that it willrun in either mirror mode or non-mirror mode, where when in mirror modethe intelligent data operation engine servers in a cluster of serversthat the action handler distributes the command actions to areconfigured a same way and contain a same set of data, and where when innon-mirror mode the intelligent data operation engine servers in thecluster of servers that the action handler distributes the commandactions to are configured differently and contain different sets ofdata.
 16. The intelligence platform of claim 1, wherein the distributedindex handler is configured to create the common index for a cluster ofthe intelligent data operation engine servers, where the distributedindex handler distributes index commands to the two or more intelligentdata operation engine servers, so that index commands are executed morequickly and processing time is saved.
 17. The intelligence platform ofclaim 16, wherein the distributed index handler is also configured todistribute index commands when a lack of feedback from a particularintelligent data operation engine server occurs to ensure uninterruptedservice when any of the intelligent data operation engine server shouldfail, where distributed index handler is configured to receive indexinformation from one or more connectors, where the distributed indexhandler is user configurable to decide which mode that it will run ineither mirror mode or non-mirror mode, where when in mirror mode thedistributed index handler distributes all the index data it receives toall of the intelligent data operation engine servers that it isconnected to and all of these intelligent data operation engine serversare configured in the same way and contain the same data set, and wherewhen in non-mirror mode the distributed index handler distributes allthe index data it receives evenly across selected intelligent dataoperation engine servers that it is connected to and running thedistributed index handler in non-mirror mode assists in servicing anamount of data that is to be indexed that is too large for a singleinstance of the intelligent data operation engine server to process. 18.The intelligence platform of claim 1, wherein the first instance of theintelligent data operation engine server is also configured to performadaptive probabilistic concept caching in which frequently-used conceptsare maintained in a memory of the server and query results are returnedusing multi-tier caching that searches both all of the electronic filesstored in repositories of interest while in parallel also searching thefrequently-used concepts to find relevant electronic files to a queryrequest, ensuring that a minimum number of operations are performed toprovide the relevant electronic files.
 19. The intelligence platform ofclaim 1, wherein the first instance of the intelligent data operationengine server is configured to perform at least the following actions tocreate a conceptual understanding of an electronic file eliminateinconsequential information from any supplied content in the electronicfile; generate a set of key terms, wherein the set of terms includesingular terms, higher order terms, noun phrases, proper names, and anycombination of these; assign a frequency of occurrence weight to eachmain term and higher order combination of terms in each sentence of theelectronic file; factor in rarity of each key term and any hierarchicalimportance given to each key term by the structure of the electronicfile; assign a set of tags XML tags into the conceptual understanding ofthat electronic file while maintaining any original XML tags; store oneor more conceptual understandings of that electronic file; modify theassigned weight values over time as the conceptual understanding is usedand matches to queries are made; and store the conceptual understandingwith the modified weight values.
 20. A method for processing big datasets including unstructured data and structured data in a network ofmodular and distributed servers, comprising: assigning weights to 1)terms found in a content of an electronic file, 2) a corpus of theelectronic file, 3) a document level tier of the electronic file, and 4)any combination of the three; using idea distancing between similarconcepts stored in a relational database in order to form a conceptualunderstanding of content in each electronic file; using both thestatistical algorithms and the idea distancing in order to form aconceptual understanding of content in each electronic file; cooperatingwith a distributed index handler to index the conceptual understandingof the electronic file, where a common index for an aggregate amount ofelectronic files in the big data sets includes at least a conceptualunderstanding of a first electronic file that contains structured dataand a conceptual understanding of a second electronic file that containsunstructured data; communicating conceptual understandings of theelectronic files organized in the common index across two or moreintelligent data operation engine servers; storing the aggregate amountof electronic files in the big data sets; supporting a distribution ofcommand actions in a common protocol to the two or more intelligent dataoperation engine servers, where a query pipeline and the distributedindex handler cooperate with the two or more intelligent data operationengine; and placing instances of the intelligent data operation engineservers to run in 1) mirroring mode to scale to process massive amountsof queries per unit time because the instances are replicated copies ofeach other and the processing of the queries is load balanced betweenthe replicated instances of the intelligent data operation engineservers, or 2) run in non-mirroring mode to process queries with lesslatency between a query request and a search result response because theinstances are 1) configured differently or 2) work on different sets ofdata causing less latency between the query request and the searchresult response, where an operator of the distributed server system hasan option to place the instances of the intelligent data operationengine servers to run in mirroring mode or non-mirroring mode.